Author: Bright Kyeremeh (MrBriit)
Data Description:
The data contains features extracted from the silhouette of vehicles in different
angles. Four "Corgie" model vehicles were used for the experiment: a double
decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This
particular combination of vehicles was chosen with the expectation that the
bus, van and either one of the cars would be readily distinguishable, but it
would be more difficult to distinguish between the cars.
Domain:
Object recognition
Context:
The purpose is to classify a given silhouette as one of three types of vehicle,
using a set of features extracted from the silhouette. The vehicle may be viewed
from one of many different angles.
Attribute Information:
● All the features are geometric features extracted from the silhouette.
● All are numeric in nature.
Learning Outcomes:
● Exploratory Data Analysis
● Reduce number dimensions in the dataset with minimal information loss
● Train a model using Principle Components
Objective:
Apply dimensionality reduction technique – PCA and train a model using
principle components instead of training the model using just the raw data.
NB: In this notebook, we will be importing our libraries as at when we need.
Import pandas as pd #loading our dataset
df= pd.read_csv("vehicle-1.csv")
df.head() # view the first 5 rows
df.dtypes #checking the data types of each column
df.shape #the shape of the dataset
df.describe().T #the 5-number summary of the dataset
df.isna().sum() #checking for null values
#Defining dependent variables and the independent variables
#creating a copy in order to comapare the two datasets (with and without missing values)
#newdf = df.copy()
X = df.iloc[:,0:18] #selecting the numerical attributes
y = df.iloc[:,18] #selecting class attribute.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
transformed_values = imputer.fit_transform(X)
column = X.columns
df1 = pd.DataFrame(transformed_values, columns = column )
df1.isnull().sum()
#Distribution of the independent variables
plt.style.use('seaborn-whitegrid')
df1.hist(bins=20, figsize=(60,40), color='lightblue', edgecolor = 'red')
plt.show()
From the above plot, it can be seen most of the attributes are normally distributed with few skewed to the right and left.
#boxplot distribution of the independent variables
plt.figure(figsize= (30,20))
sns.boxplot(data=df1,orient="h")
Scaled_variance_1 is having a huge effect on our data distribution since is having a wider measurement of scale. We will therefore drop that and visualise it alone while we visualise the rest of the data set.
#boxplot distribution of the independent variables without Scaled_variance_1
plt.figure(figsize= (30,20))
sns.boxplot(data=df1.drop('scaled_variance.1',axis=1),orient="h")
now we can have a better look at our data and can observe that many of data attributes contains outliers
#Distribution of variables with most outliers
plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= df1['max.length_aspect_ratio'], color='green')
plt.subplot(3,3,2)
sns.boxplot(x= df1['scaled_radius_of_gyration.1'], color='lightblue')
plt.subplot(3,3,3)
sns.boxplot(x= df1['scaled_radius_of_gyration.1'], color='lightblue')
plt.show()
As can be seen there a lot of outliers present in max.length_aspect_ratio,scaled_radius_of_gyration.1 and scaled_radius_of_gyration.1
from scipy.stats import iqr
Q1 = df1.quantile(0.25)
Q3 = df1.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
df2 = df1[~((df1 < (Q1 - 1.5 * IQR)) |(df1 > (Q3 + 1.5 * IQR))).any(axis=1)]
df2.shape
df1.shape
plt.figure(figsize= (20,15))
plt.subplot(8,8,1)
sns.boxplot(x= df2['pr.axis_aspect_ratio'], color='orange')
plt.subplot(8,8,2)
sns.boxplot(x= df2['skewness_about'], color='purple')
plt.subplot(8,8,3)
sns.boxplot(x= df2['scaled_variance'], color='brown')
plt.subplot(8,8,4)
sns.boxplot(x= df2['radius_ratio'], color='red')
plt.subplot(8,8,5)
sns.boxplot(x= df2['scaled_radius_of_gyration.1'], color='lightblue')
plt.subplot(8,8,6)
sns.boxplot(x= df2['scaled_variance.1'], color='yellow')
plt.subplot(8,8,7)
sns.boxplot(x= df2['max.length_aspect_ratio'], color='lightblue')
plt.subplot(8,8,8)
sns.boxplot(x= df2['skewness_about.1'], color='pink')
plt.show()
We can clearly seen that the outliers are being removed. We could have ignored this process if the outliers were too many and out datasets were also large.
#Counts of our dependent variable
print(y.value_counts())
#splitscaledf = df1.copy()
sns.countplot(y)
plt.show()
df1.corr()
#Heatmap of the correlation between the indepent attributes
plt.figure(figsize=(30,15))
sns.heatmap(df1.corr(), vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation between different attributes')
plt.show()
- pr.axis_recatngularity and Scaled Variance.1 have very high correlated with value of 0.99
- scatter_ratio and pr.axis_recatngularity have very high correlated with value of 0.99
- max.length_recatngularity and circularity also have very high correlated with value of 0.96
Among other features as well.
However, there some features which very low correlated and even negatively correlated such as:
- skewness_about.2 and circularity with a value of -0.1
- scaled_radius_of_gyration_1 and radius_ratio with a value of -0.18
Among other relationships can be clearly seen from the heatmap
#Pairplot of the correlation/distribution between various independent attributes
sns.pairplot(df1, diag_kind="kde")
The pairplot above validate the insights derived from our earlier heatmap. Scaled Variance & Scaled Variance.1 seems to be have very strong positive correlation with value of 0.95. skewness_about_2 and hollow_ratio also seems to have strong positive correation with coeff: 0.89
scatter_ratio and elongatedness seems to be have very strong negative correlation. elongatedness and pr.axis_rectangularity seems to have strong negative correlation with val of -0.97
We found from our pairplot analysis that, Scaled Variance & Scaled Variance.1 and elongatedness and pr.axis_rectangularity to be strongly correlated , so they need to dropped of treated carefully before we go for model building.
With our objective of predicting an object to be a van or bus or car based on some input features, ideally, we assume that there is little or no multicollinearity between the features. If otherwise our data contains features that are highly correlated then we will encounter what is know as “Multicollinearity”.
Multicollinearity can lead to a misleading results. This situation happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy.
It is quit reasonable to drop one feature if we have 2 features in our datset which are highly correlated since there's no point in using both features. From the above heatmap as well as the pairplot, we recongnised that there lots of features that are highly correlated, negatively or positively with values as high as 0.99 and -0.97.
These features are listed below:
max.length_rectangularity
scaled_radius_of_gyration
skewness_about.2
scatter_ratio
elongatedness
pr.axis_rectangularity
scaled_variance
scaled_variance.1
As pointed out earlier,the easiest way to deal with multicollinearity is to delete one of the highly correlated features. However, we will be using a better approach known as dimension reduction specifically Principle Component Analysis (PCA).
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.
#printing the shape of dependent and independnet attributes
print("shape of Independent attributes:",df1.shape)
print("shape of Dependent attributes:",y.shape)
from scipy.stats import zscore
XScaled=df1.apply(zscore)
XScaled.head()
#Alternatively
# from sklearn.preprocessing import StandardScaler
# #We transform (centralize) the entire X (independent variable data) to normalize it using standardscalar through transformation. We will create the PCA dimensions
# # on this distribution.
# sdsc = StandardScaler()
# X_std = sdsc.fit_transform(df1)
#Getting the confusion matrix
covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
covMatrix.shape #shape of the confusion matrix
#Performing PCA on all the 18 components
from sklearn.decomposition import PCA
pca = PCA(n_components=18)
pca.fit(XScaled)
The eigen Values
print(pca.explained_variance_)
The eigen Vectors
print(pca.components_)
print(pca.explained_variance_ratio_) #the variance in the Eigen vectors that we can explain. This can be visualised in the next cell below
#visualisation of the explained variance in the Eigen vectors
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
#Elbow visualisation of variance in the Eigen vectors
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
Now 8 dimensions seems very reasonable. With 8 variables we can explain over 95% of the variation in the original data!
#Using only 8 Eigen vectors that can be explained instead of the 18 vectors
pca3 = PCA(n_components=8)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
Xpca3 #now we have only 8 variables instead of 18
#pairplot of the new variables showing no correlation
sns.pairplot(pd.DataFrame(Xpca3))
Lets construct two linear models. The first with all the 17 independent variables and the second with only the 8 new variables constructed using PCA.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y.shape
# from sklearn.linear_model import LinearRegression
# regression_model = LinearRegression()
# regression_model.fit(XScaled, y)
# regression_model.score(XScaled, y)
# regression_model_pca = LinearRegression()
# regression_model_pca.fit(Xpca3, y)
# regression_model_pca.score(Xpca3, y)
from sklearn.model_selection import KFold
X_train, X_test, y_train, y_test = train_test_split(df1, y, test_size = 0.1942313295, random_state = 14)
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
import warnings
warnings.filterwarnings('ignore')
from sklearn.svm import SVC
svc = SVC()
#svc.fit(X_train,y_train)
#Orig_y_predict = svc.predict(X_test) #predict on test data
#svc.score(X_test, y_test)
#now split the data into 70:30 ratio
#orginal Data
Orig_X_train,Orig_X_test,Orig_y_train,Orig_y_test = train_test_split(XScaled,y,test_size=0.30,random_state=1)
#PCA Data
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(Xpca3,y,test_size=0.30,random_state=1)
svc.fit(Orig_X_train,Orig_y_train) #SVC on original data
Orig_y_predict = svc.predict(Orig_X_test) #Prediction on original dataset
#now fit the model on pca data with new dimension
svc1 = SVC() #instantiate the object
svc1.fit(pca_X_train,pca_y_train)
#predict the y value
pca_y_predict = svc1.predict(pca_X_test) #Prediction on pca test dataset
#display accuracy score of both models
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report,roc_auc_score
print("Model Score On Original Data ",svc.score(Orig_X_test, Orig_y_test))
print("Model Score On Reduced PCA Dimension ",svc1.score(pca_X_test, pca_y_test))
print("-------"*10)
print("Before PCA On Original 18 Dimension",accuracy_score(Orig_y_test,Orig_y_predict))
print("After PCA(On 8 dimension)",accuracy_score(pca_y_test,pca_y_predict))
Our support vector classifier without performing PCA has an accuracy score of 95% on training data set
SVC model on PCA componenets(reduced dimensions) has an accuracy score of 93 %
Looks like by drop reducing dimensionality to 8 components, we only dropped around 2% in R^2! This is insample (on training data) and hence a drop in R^2 is expected. Still seems easy to justify the dropping of variables. An out of sample (on test data), with the 8 independent variables is likely to do better since that would be less of an over-fit.
# Calculate Confusion Matrix & PLot To Visualize it
def draw_confmatrix(y_test, yhat, str1, str2, str3, datatype ):
#Make predictions and evalute
#model_pred = fit_test_model(model,X_train, y_train, X_test)
cm = confusion_matrix( y_test, yhat, [0,1,2] )
print("Confusion Matrix For :", "\n",datatype,cm )
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [str1, str2,str3] , yticklabels = [str1, str2,str3] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
draw_confmatrix(Orig_y_test, Orig_y_predict,"Van ", "Car ", "Bus", "Original Data Set" )
draw_confmatrix(pca_y_test, pca_y_predict,"Van ", "Car ", "Bus", "Reduced Dimensions Using PCA ")
#Classification Report Of Model built on Raw Data
print("Classification Report For Raw Data:", "\n", classification_report(Orig_y_test,Orig_y_predict))
#Classification Report Of Model built on Principal Components:
print("Classification Report For PCA:","\n", classification_report(pca_y_test,pca_y_predict))
Confusion Matrix On Original Data:
Our model on original data set has correctly classified 58 van out of 59 actuals vans with only 1 wrongly predicted as Car
our model has correcly classified 129 cars and has wrongly classified 3 cars to be a bus and also 1 car to be a van
Again, in the case of 62 instances of actual bus , our model has correctly classified 56 buses , It has faltered in classifying wrongly 6 buses to be a van and 1 bus to be a car.
Confusion Metric On Reduced Dimesnion After PCA :
Out of 133 actuals cars , our mdoel has correclty classified 126 of them to be a car and faltered in 7 cases where it wrongly classified 5 cars to a bus and 2 cars to be a van.
Out of 62 actual bus , our model has correclty classified 54 of them to be a bus. It has faltered in 8 cases where it wrongly classified 7 bus to be a car and 1 bus to be a van.
Insights On Classification Reports:
On original data:
our model has 99 % precison score when it comes to classify car from the given set of silhoutte parameters. It has 89 % precision when it comes to classifying the input as van, while it has 93 % precison when it come to predict data as bus.
In term of recall score our model has recal score of 98 % for van classification, 97 % for car and 89 % for bus.
On Reduced Dimensions After PCA: